Search Results/Filters    

Filters

Year

Banks




Expert Group











Full-Text


Issue Info: 
  • Year: 

    2023
  • Volume: 

    20
  • Issue: 

    3
  • Pages: 

    103-126
Measures: 
  • Citations: 

    0
  • Views: 

    142
  • Downloads: 

    68
Abstract: 

Measuring similarity between two text snippets is one of the essential tasks in many NLP problems and it has been still one of the most challenging tasks in the field. Various methods have been proposed to measure text similarity. This survey reviews more than 150 of the related papers, introduces a comprehensive taxonomy with three main categories, and discusses the advantages and disadvantages of these methods. The first category is lexical methods that only focus on text pair’s surface similarity. These methods consider the text as a sequence of characters, tokens, or a mixture of these two. Some recent studies use deep learning techniques for detecting lexical similarity in alias detection task. The second category is semantic methods that take into consideration the meaning of the words based on some pre-prepared knowledge-bases like Wordnet or using Corpus-based methods. Some recent studies use modern deep learning techniques like transformers and Siamese networks to create document embedding that outperform other methods. The final category is hybrid methods that take advantage of all other methods even syntactic parsing in some cases. Note that high-quality syntactic parsers are not present for many languages and that using them has some side-effects on performance and speed.

Yearly Impact: مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 142

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 68 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesCitation 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesRefrence 0
Issue Info: 
  • Year: 

    2019
  • Volume: 

    17
  • Issue: 

    1
  • Pages: 

    17-31
Measures: 
  • Citations: 

    0
  • Views: 

    208
  • Downloads: 

    108
Abstract: 

This article presents an empirical evaluation to investigate the distributional semantic power of abstract, body and full-text, as different text levels, in predicting the semantic similarity using a collection of open access articles from PubMed. The semantic similarity is measured based on two criteria namely, linear MeSH terms intersection and hierarchical MeSH terms distance. As such, a random sample of 200 queries and 20000 documents are selected from a test collection built on CITREC open source code. Sim Pack Java Library is used to calculate the textual and semantic similarities. The nDCG value corresponding to two of the semantic similarity criteria is calculated at three precision points. Finally, the nDCG values are compared by using the Friedman test to determine the power of each text level in predicting the semantic similarity. The results showed the effectiveness of the text in representing the semantic similarity in such a way that texts with maximum textual similarity are also shown to be 77% and 67% semantically similar in terms of linear and hierarchical criteria, respectively. Furthermore, the text length is found to be more effective in representing the hierarchical semantic compared to the linear one. Based on the findings, it is concluded that when the subjects are homogenous in the tree of knowledge, abstracts provide effective semantic capabilities, while in heterogeneous milieus, full-texts processing or knowledge bases is needed to acquire IR effectiveness.

Yearly Impact: مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 208

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 108 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesCitation 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesRefrence 0
Issue Info: 
  • Year: 

    2021
  • Volume: 

    7
Measures: 
  • Views: 

    78
  • Downloads: 

    0
Abstract: 

Graphs and graph databases are applicable over a wide range of domains, including text mining and web mining. Using graphs to represent relationships between entities provides enriched models for emerging tasks of web search and information retrieval. Natural language processing algorithms use graphs to model structural relationships of texts efficiently, resulting in improved performance. However, the need to increase the accuracy of graph construction and weight allocation remains a fundamental challenge. Existing methods for these tasks provide limited efficiency and lack scalability for large graphs. In this study, we propose a novel graph-based method for text modeling and running a query to evaluate the similarity of text segments. In this method, the graph corresponding to the text is first created by modeling words and named entities by the state-of-the-art pre-trained BERT model. Graph nodes are then weighted in two stages. In the first stage, the nodes with more generalization obtain higher weights. The second weighting stage is done by the graph obtained from the query text. In this weighting step, nodes are considered important if they are specifically related to the query text. After determining the important nodes in the graph, the semantic similarity between the query text and the texts in the database is measured. The whole process of this framework uses a natural language processing pipeline in Apache Spark scalable platform. The efficiency of the model was evaluated for both distributed and non-distributed configuration and its scalability on a Spark cluster. Evaluation of the accuracy using the Pearson correlation coefficient shows that the proposed method performs higher performance than its competitors.

Yearly Impact:   مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 78

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 0
Author(s): 

Majma Negar | Bashtin Sara

Journal: 

SOFT COMPUTING

Issue Info: 
  • Year: 

    2022
  • Volume: 

    11
  • Issue: 

    1
  • Pages: 

    0-0
Measures: 
  • Citations: 

    0
  • Views: 

    41
  • Downloads: 

    0
Abstract: 

In the last decade, with the expansion of the World Wide Web, the speed and ease of access to ideas, documents, articles, manuscripts, and data collected by others has increased. This has made the exchange of information and ideas between researchers and producers of science easier, but on the other hand, it has made it easier to apply unauthorized copies, write summaries without mentioning the source, and steal literary texts in general. Since universities and educational centers make scientific and research resources easily available to most users, recognizing the authenticity of scientific texts in these centers is more important and, of course, more sensitive. In this research, a method is presented to compare the related parts using the blocking of document parts. In the proposed method, after classifying the documents into two categories of main documents and suspicious documents, preprocessing has been done with the aim of eliminating word stops and new wording. Then the documents are segmented and using cosine similarity, the degree of similarity of the texts with each other is determined. The proposed method in the test of 50 documents in the data set has an accuracy of 94%, which is an improvement of 2% compared to one of the similar methods.

Yearly Impact: مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 41

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesCitation 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesRefrence 0
Issue Info: 
  • Year: 

    2015
  • Volume: 

    3
  • Issue: 

    4
  • Pages: 

    216-223
Measures: 
  • Citations: 

    0
  • Views: 

    379
  • Downloads: 

    157
Abstract: 

Finding similar web contents have great efficiency in academic community and software systems. There are many methods and metrics in literature to measure the extent of text similarity among various documents and some its application especially in plagiarism detection systems. However, most of them do not take ambiguity inherent in word or text pair’s comparison that gained form linguistic experts as well as structural features into account. As a result, pervious methods did not have enough accuracy to deal vague information. So using structural features and considering ambiguity inherent word improve the identification of similar contents. In this paper, a new method has been proposed that taking lexical and structural features in text similarity measures into consideration. After preprocessing and removing stop words, each text was divided into general words and domain-specific knowledge words. For each part, appropriate features and measures are extracted. Then, the two lexical and structural fuzzy inference systems were designed to assess lexical and structural text similarity respectively. The proposed method has been evaluated on Persian paper abstracts of International Conference on e-Learning and e-Teaching (ICELET) Corpus. The results shows that the proposed method can achieve a rate of 75% in terms of precision and can detect 81% of the similar cases.

Yearly Impact: مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 379

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 157 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesCitation 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesRefrence 0
Author(s): 

Hajipoor O. | SADIDPOUR S.S.

Issue Info: 
  • Year: 

    2020
  • Volume: 

    8
  • Issue: 

    2 (30)
  • Pages: 

    105-114
Measures: 
  • Citations: 

    0
  • Views: 

    1083
  • Downloads: 

    0
Abstract: 

With the growing number of Persian electronic documents and texts, the use of quick and inexpensive methods to access desired texts from the extensive collection of these documents becomes more important. One of the effective techniques to achieve this goal is the extraction of the keywords which represent the main concept of the text. For this purpose, the frequency of a word in the text can not be a proper indication of its significance and its crucial role. Also, most of the keyword extraction methods ignore the concept and semantic of the text. On the other hand, the unstructured nature of new texts in news and electronic documents makes it difficult to extract these words. In this paper, an automated, unsupervised method for keywords extraction in the Persian language that does not have a proper structure is proposed. This method not only takes into account the probability of occurrence of a word and its frequency in the text, but it also understands the concept and semantic of the text by learning word2vec model on the text. In the proposed method, which is a combination of statistical and machine learning methods, after learning word2vec on the text, the words that have the smallest distance with other words are extracted. Then, a statistical equation is proposed to calculate the score of each extracted word using co-occurence and frequency. Finally, words which have the highest scores are selected as the keywords. The evaluations indicate that the efficiency of the method by the F-measure is 53. 92% which is 11% superior to other methods.

Yearly Impact: مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 1083

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesCitation 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesRefrence 0
Issue Info: 
  • Year: 

    2008
  • Volume: 

    30
  • Issue: 

    3
  • Pages: 

    37-40
Measures: 
  • Citations: 

    1
  • Views: 

    1048
  • Downloads: 

    0
Abstract: 

Background and Objectives: Many studies have been done to understand the nature and mechanisms of verbal short term memory. These studies have led to linguistic and nonlinguistic approaches to it. Phonological similarity effect as an important finding of these studies increased the conflict between both approaches. Regarding differences between languages, cross- language investigations may be helpful. The aim of this study is to investigate the effect of phonological similarity on span of verbal short term memory in Persian language.Material and Methods: In this descriptive analytic study, 16 graduate and postgraduate students (mean age 20 years, SD=2.03) participated (4 males, remaining females). All participants were native Persian (monolingual) without any speech or hearing disorders. Stimuli were 450 words categorized in 3 different lists, namely rhyming words list, alliterative words list and dissimilar words list. Each list consisted of twenty five 6-words sequences (150 words in each list). Stimuli were presented via a speaker. There was a 1 second interval between words in each sequence. Three seconds after presenting each sequence a signal was heard as a sign to start the recall.Results: A one-way ANOVA test showed significant difference between rhyming, alliterative and dissimilar words (p= 0.0000). Poshtoc Tukey test showed significant difference between rhyming list and dissimilar list (0.000). Also a significant difference was shown between alliterative and dissimilar list (0.006). There was no difference between rhyming and alliterative lists.Conclusion: These data suggests that in rhyming and alliterative words, vowel, because of higher sonority (rather than other phonemes) enhances the memory span as a cueing feature.Cross-language differences, especially in phonemes sonority level may cause different phonological similarity effects among languages. Since verbal short term memory is sensitive to vowel in words, it seems that the verbal short term memory has a linguistic nature.

Yearly Impact: مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 1048

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesCitation 1 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesRefrence 0
Author(s): 

ZHANG W.

Issue Info: 
  • Year: 

    2012
  • Volume: 

    3
  • Issue: 

    2
  • Pages: 

    1-25
Measures: 
  • Citations: 

    1
  • Views: 

    133
  • Downloads: 

    0
Keywords: 
Abstract: 

Yearly Impact: مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 133

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesCitation 1 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesRefrence 0
Author(s): 

Issue Info: 
  • Year: 

    2023
  • Volume: 

    45
  • Issue: 

    3
  • Pages: 

    3097-3113
Measures: 
  • Citations: 

    1
  • Views: 

    12
  • Downloads: 

    0
Keywords: 
Abstract: 

Yearly Impact: مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 12

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 0 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesCitation 1 مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesRefrence 0
Issue Info: 
  • Year: 

    2017
  • Volume: 

    3
Measures: 
  • Views: 

    190
  • Downloads: 

    0
Abstract: 

MOST OF THE CLASSIFICATION ALGORITHMS HAVE BEEN DEVISED TO CLASSIFY LONG TEXTS, SUCH AS EMAIL AND WEB PAGES WHICH OVERSHADOWED THEIR EFFECTIVENESS ON SHORT AND SOMETIMES INFORMAL TEXTS. IN THIS PAPER, WE EVALUATED THE ACCURACY OF FOUR MAJOR CLASSIFICATION ALGORITHMS ON PERSIAN SHORT TEXTS. THESE ALGORITHMS ARE NAÏVE BAYES, K-NEAREST NEIGHBORS, DECISION TREES AND SUPPORT VECTOR MACHINE. FIRST, WE BRIEFLY INTRODUCE THEIR OVERALL METHOD AND PROVIDE SOME BASIC INFORMATION, AND THEN, WE APPLY THESE ALGORITHMS TO ONE SPECIFIC DATASET TO MEASURE THEIR EFFECTIVENESS. RESULTS SHOW THAT THE NAÏVE BAYES ALGORITHM FUNCTION COMPARATIVELY BETTER THAN THE OTHERS, WHILE KNN ALGORITHM HAS THE LEAST ACCURACY.

Yearly Impact:   مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View 190

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic ResourcesDownload 0
litScript
email sharing button
telegram sharing button
whatsapp sharing button
linkedin sharing button
twitter sharing button
email sharing button
email sharing button
sharethis sharing button